{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "intro_to_fairness.ipynb",
      "provenance": [],
      "collapsed_sections": [
        "J8daw3YOIAXH",
        "xFxZOg55lWJE",
        "l-K-xqksm-X3",
        "TXkkHYyJ98_k",
        "91wjnZFpPWw-",
        "KlF-lQ8yQ69b",
        "qZ-9vJgSEpHj",
        "7YVH8hYfSjer",
        "2lx4JuLdi7jw",
        "TF3B5h3c-7Fb"
      ]
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "y5T8lbpLd1sr",
        "colab": {}
      },
      "source": [
        "#@title Copyright 2020 Google LLC. Double-click here for license information.\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "84x4Fxc5lzFv"
      },
      "source": [
        "# Introduction to Fairness in ML\n",
        "***"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "J8daw3YOIAXH"
      },
      "source": [
        "## Disclaimer\n",
        "This exercise explores just a small subset of  ideas and techniques relevant to fairness in machine learning; it is not the whole story!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xFxZOg55lWJE"
      },
      "source": [
        "## Learning Objectives\n",
        "\n",
        "* Increase awareness of different types of biases that can manifest in model data.\n",
        "* Explore feature data to proactively identify potential sources of bias before training a model.\n",
        "* Evaluate model performace by subgroup rather than in aggregate."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "l-K-xqksm-X3"
      },
      "source": [
        "## Overview\n",
        "\n",
        "In this exercise, you'll explore datasets and evaluate classifiers with *fairness* in mind, noting the ways undesirable biases can creep into machine learning (ML).\n",
        "\n",
        "Throughout, you will see **FairAware** tasks, which provide opportunities to contextualize ML processes with respect to fairness. In performing these tasks, you'll identify biases and consider the long-term impact of model predictions if these biases are not addressed."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TXkkHYyJ98_k"
      },
      "source": [
        "## About the Dataset and Prediction Task\n",
        "\n",
        "In this exercise, you'll work with the [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income), which is commonly used in machine learning literature. This data was extracted from the [1994 Census bureau database](http://www.census.gov/en.html) by Ronny Kohavi and Barry Becker.\n",
        "\n",
        "Each example in the dataset contains the following demographic data for a set of individuals who took part in the 1994 Census:\n",
        "\n",
        "### Numeric Features\n",
        "*   `age`: The age of the individual in years.\n",
        "*   `fnlwgt`: The number of individuals the Census Organizations believes that set of observations represents.\n",
        "*   `education_num`:  An enumeration of the categorical representation of education. The higher the number, the higher the education that individual achieved. For example, an `education_num` of `11` represents `Assoc_voc` (associate degree at a vocational school), an `education_num` of `13` represents `Bachelors`, and an `education_num` of `9` represents `HS-grad` (high school graduate).\n",
        "*   `capital_gain`: Capital gain made by the individual, represented in US Dollars.\n",
        "*   `capital_loss`: Capital loss mabe by the individual, represented in US Dollars.\n",
        "*   `hours_per_week`: Hours worked per week.\n",
        "\n",
        "### Categorical Features\n",
        "*   `workclass`: The individual's type of employer. Examples include: `Private`, `Self-emp-not-inc`, `Self-emp-inc`, `Federal-gov`, `Local-gov`, `State-gov`, `Without-pay`, and `Never-worked`.\n",
        "*   `education`: The highest level of education achieved for that individual.\n",
        "*   `marital_status`: Marital status of the individual. Examples include: `Married-civ-spouse`, `Divorced`, `Never-married`, `Separated`, `Widowed`, `Married-spouse-absent`, and `Married-AF-spouse`.\n",
        "*   `occupation`: The occupation of the individual. Example include: `tech-support`, `Craft-repair`, `Other-service`, `Sales`, `Exec-managerial` and more.\n",
        "*   `relationship`:  The relationship of each individual in a household. Examples include: `Wife`, `Own-child`, `Husband`, `Not-in-family`, `Other-relative`, and `Unmarried`.\n",
        "*   `gender`:  Gender of the individual available only in binary choices: `Female` or `Male`.\n",
        "*   `race`: `White`, `Asian-Pac-Islander`, `Amer-Indian-Eskimo`, `Black`, and `Other`. \n",
        "*   `native_country`: Country of origin of the individual. Examples include: `United-States`, `Cambodia`, `England`, `Puerto-Rico`, `Canada`, `Germany`, `Outlying-US(Guam-USVI-etc)`, `India`, `Japan`, and more.\n",
        "\n",
        "### Prediction Task\n",
        "The prediction task is to **determine whether a person makes over $50,000 US Dollar a year.**\n",
        "\n",
        "### Label\n",
        "*   `income_bracket`: Whether the person makes more than $50,000 US Dollars annually.\n",
        "\n",
        "### Notes on Data Collection\n",
        "\n",
        "All the examples extracted for this dataset meet the following conditions: \n",
        "*   `age` is 16 years or older.\n",
        "*   The adjusted gross income (used to calculate `income_bracket`) is greater than $100 USD annually.\n",
        "*   `fnlwgt` is greater than 0.\n",
        "*   `hours_per_week` is greater than 0.\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "I0RMIktKy8xX"
      },
      "source": [
        "## Setup\n",
        "\n",
        "First, we should ensure that this Colaboratory notebook will run on TensorFlow 2.x."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "MelAK2u6d-xx",
        "colab": {}
      },
      "source": [
        "#@title Run on TensorFlow 2.x\n",
        "%tensorflow_version 2.x\n",
        "from __future__ import absolute_import, division, print_function, unicode_literals"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "jUsgiVsUeKRR"
      },
      "source": [
        "Next, we'll import the necessary modules to run the code in the rest of this Colaboratory notebook. \n",
        "\n",
        "In addition to importing the usual libraries, this setup code cell also installs [Facets](https://pair-code.github.io/facets/), an open-source tool created by [PAIR](https://research.google/teams/brain/pair/) that contains two robust visualizations we'll be using to aid in understanding and analyzing ML datasets."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "2e_0DJJ8zE29",
        "colab": {}
      },
      "source": [
        "#@title Import revelant modules and install Facets\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import tensorflow as tf\n",
        "from tensorflow.keras import layers\n",
        "from matplotlib import pyplot as plt\n",
        "from matplotlib import rcParams\n",
        "import seaborn as sns\n",
        "\n",
        "# The following lines adjust the granularity of reporting. \n",
        "pd.options.display.max_rows = 10\n",
        "pd.options.display.float_format = \"{:.1f}\".format\n",
        "\n",
        "from google.colab import widgets\n",
        "# For facets\n",
        "from IPython.core.display import display, HTML\n",
        "import base64\n",
        "!pip install facets-overview==1.0.0\n",
        "from facets_overview.feature_statistics_generator import FeatureStatisticsGenerator"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "-xgIRapb5LaQ"
      },
      "source": [
        "### Load the Adult Dataset\n",
        "\n",
        "With the modules now imported, we can load the Adult dataset into a pandas DataFrame data structure."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "TeCNVvVUVS0P",
        "colab": {}
      },
      "source": [
        "COLUMNS = [\"age\", \"workclass\", \"fnlwgt\", \"education\", \"education_num\",\n",
        "           \"marital_status\", \"occupation\", \"relationship\", \"race\", \"gender\",\n",
        "           \"capital_gain\", \"capital_loss\", \"hours_per_week\", \"native_country\",\n",
        "           \"income_bracket\"]\n",
        "\n",
        "train_csv = tf.keras.utils.get_file('adult.data', \n",
        "  'https://download.mlcc.google.com/mledu-datasets/adult_census_train.csv')\n",
        "test_csv = tf.keras.utils.get_file('adult.data', \n",
        "  'https://download.mlcc.google.com/mledu-datasets/adult_census_test.csv')\n",
        "\n",
        "train_df = pd.read_csv(train_csv, names=COLUMNS, sep=r'\\s*,\\s*', \n",
        "                       engine='python', na_values=\"?\")\n",
        "test_df = pd.read_csv(test_csv, names=COLUMNS, sep=r'\\s*,\\s*', skiprows=[0],\n",
        "                      engine='python', na_values=\"?\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "coilRN-hooja"
      },
      "source": [
        "## Analyzing the Adult Dataset with Facets\n",
        "\n",
        "As mentioned in MLCC, it is important to understand your dataset *before* diving straight into the prediction task. \n",
        "\n",
        "Some important questions to investigate when auditing a dataset for fairness:\n",
        "\n",
        "* **Are there missing feature values for a large number of observations?**\n",
        "* **Are there features that are missing that might affect other features?**\n",
        "* **Are there any unexpected feature values?**\n",
        "* **What signs of data skew do you see?**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9yCIuAqWA1Pm"
      },
      "source": [
        "To start, we can use [Facets Overview](https://pair-code.github.io/facets/), an interactive visualization tool that can help us explore the dataset. With Facets Overview, we can quickly analyze the distribution of values across the Adult dataset."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "MW-qryqs1gig",
        "colab": {}
      },
      "source": [
        "#@title Visualize the Data in Facets\n",
        "fsg = FeatureStatisticsGenerator()\n",
        "dataframes = [\n",
        "    {'table': train_df, 'name': 'trainData'}]\n",
        "censusProto = fsg.ProtoFromDataFrames(dataframes)\n",
        "protostr = base64.b64encode(censusProto.SerializeToString()).decode(\"utf-8\")\n",
        "\n",
        "\n",
        "HTML_TEMPLATE = \"\"\"<script src=\"https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js\"></script>\n",
        "        <link rel=\"import\" href=\"https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html\">\n",
        "        <facets-overview id=\"elem\"></facets-overview>\n",
        "        <script>\n",
        "          document.querySelector(\"#elem\").protoInput = \"{protostr}\";\n",
        "        </script>\"\"\"\n",
        "html = HTML_TEMPLATE.format(protostr=protostr)\n",
        "display(HTML(html))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "91wjnZFpPWw-"
      },
      "source": [
        "### FairAware Task #1\n",
        "\n",
        "Review the descriptive statistics and histograms for each numerical and continuous feature. Click the **Show Raw Data** button above the histograms for categorical features to see the distribution of values per category.\n",
        "\n",
        "Then, try to answer the following questions from earlier:\n",
        "\n",
        "1. Are there missing feature values for a large number of observations?\n",
        "2. Are there features that are missing that might affect other features?\n",
        "3. Are there any unexpected feature values?\n",
        "4. What signs of data skew do you see?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "KlF-lQ8yQ69b"
      },
      "source": [
        "### Solution\n",
        "\n",
        "Click below for some insights we uncovered."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xX_qjj5AQ_Hb"
      },
      "source": [
        "We can see from reviewing the **missing** column that the following categorical features contain missing values:\n",
        "\n",
        "*   workclass\n",
        "*   occupation\n",
        "\n",
        "Now, because it's only a small percentage of samples that contain either a missing workclass value or occupation value, we can safely drop those rows from the data set. If that percentage was much higher, then we would have to consider using a different data set that is more complete. \n",
        "\n",
        "Luckily, in Pandas, there is a convenient way to drop any row containing a missing value in the data set:\n",
        "\n",
        "```\n",
        "# pandas.DataFrame.dropna(how=\"any\", axis=0, inplace=True)\n",
        "```\n",
        "We will use this method prior to training the model when we convert a Pandas DataFrame to a Numpy array.\n",
        "\n",
        "As for the remaining data that does not contain any missing values: if we look at the min/max values and histograms for each numeric feature, then we can pinpoint any extreme outliers in our data set. \n",
        "\n",
        "For `hours_per_week`, we can see that the minimum is 1, which might be a bit surprising, given that most jobs typically require multiple hours of work per week. For `capital_gain` and `capital_loss`, we can see that over 90% of values are 0. Given that capital gains/losses are only registered by individuals who make investments, it's certainly plausible that less than 10% of examples would have nonzero values for these feature, but we may want to take a closer look to verify the values for these features are valid.\n",
        "\n",
        "In looking at the histogram for gender, we see that over two-thirds (approximately 67%) of examples represent males. This strongly suggests data skew, as we would expect the breakdown between genders to be closer to 50/50."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hKj2hz-Sql7V"
      },
      "source": [
        "### A Deeper Dive\n",
        "\n",
        "To futher explore the dataset, we can use [Facets Dive](https://pair-code.github.io/facets/), a tool that provides an interactive interface where each individual item in the visualization represents a data point. But to use Facets Dive, we need to convert the data to a JSON array.\n",
        "Thankfully the DataFrame method `to_json()` takes care of this for us.\n",
        "\n",
        "Run the cell below to perform the data transform to JSON and also load Facets Dive. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "stlklrG_xssF",
        "colab": {}
      },
      "source": [
        "#@title Set the Number of Data Points to Visualize in Facets Dive\n",
        "\n",
        "SAMPLE_SIZE = 5000 #@param\n",
        "  \n",
        "train_dive = train_df.sample(SAMPLE_SIZE).to_json(orient='records')\n",
        "\n",
        "HTML_TEMPLATE = \"\"\"<script src=\"https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js\"></script>\n",
        "        <link rel=\"import\" href=\"https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html\">\n",
        "        <facets-dive id=\"elem\" height=\"600\"></facets-dive>\n",
        "        <script>\n",
        "          var data = {jsonstr};\n",
        "          document.querySelector(\"#elem\").data = data;\n",
        "        </script>\"\"\"\n",
        "html = HTML_TEMPLATE.format(jsonstr=train_dive)\n",
        "display(HTML(html))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "LxqAPDcRDFB2"
      },
      "source": [
        "## FairAware Task #2\n",
        "\n",
        "Use the menus on the left panel of the visualization to change how the data is organized:\n",
        "\n",
        "1. In the **Binning | X-Axis** menu, select **education**, and in the **Color By** and  **Label By** menus, select **income_bracket**. How would you describe the relationship between education level and income bracket?\n",
        "\n",
        "2. Next, in the **Binning | X-Axis** menu, select  **marital_status**, and in the **Color By** and  **Label By** menus, select **gender**. What noteworthy observations can you make about the gender distributions for each marital-status category?\n",
        "\n",
        "As you perform the above tasks, keep the following fairness-related questions in mind:\n",
        "\n",
        "* **What's missing?**\n",
        "* **What's being overgeneralized?**\n",
        "* **What's being underrepresented?**\n",
        "* **How do the variables, and their values, reflect the real world?**\n",
        "* **What might we be leaving out?**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "qZ-9vJgSEpHj"
      },
      "source": [
        "### Solution\n",
        "\n",
        "Click below for some insights we uncovered."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "uYpbgdATEx8L"
      },
      "source": [
        "1. In the data set, higher education levels generally tend to correlate with a higher income bracket. An income level of greater than $50,000 is more heavily represented in examples where education level is Bachelor's degree or higher.\n",
        "\n",
        "2. In most marital-status categories, the distribution of male vs. female values is close to 1:1. The one notable exception is \"married-civ-spouse\", where male outnumbers female by more than 5:1. Given that we already discovered in Task #1 that there is a disproportionately high representation of men in the data set, we can now infer that it's married women specifically that are underrepresented in the data."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "7YVH8hYfSjer"
      },
      "source": [
        "### Summary\n",
        "\n",
        "Plotting histograms, ranking most-to-least common examples, identifying duplicate or missing examples, making sure the training and test sets are similar, computing feature quantiles—**these are all critical analyses to perform on your data.** \n",
        "\n",
        "**The better you know what's going on in your data, the more insight you'll have as to where unfairness might creep in!**"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "2ivWw9Wpj67m"
      },
      "source": [
        "### FairAware Task #3\n",
        "\n",
        "Now that you've explored the dataset using Facets, see if you can identify some of the problems that may arise with regard to fairness based on what you've learned about its features.\n",
        "\n",
        "Which of the following features might pose a problem with regard to fairness?\n",
        "\n",
        "Choose a feature from the drop-down options in the cell below, and then run the cell to check your answer. Then explore the rest of the options to get more insight about how each influences the model's predictions."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "8bFDVCV1sxiX",
        "colab": {}
      },
      "source": [
        "feature = 'fnlwgt' #@param [\"\", \"hours_per_week\", \"fnlwgt\", \"gender\", \"capital_gain / capital_loss\", \"age\"] {allow-input: false}\n",
        "\n",
        "\n",
        "if feature == \"hours_per_week\":\n",
        "  print(\n",
        "'''It does seem a little strange to see 'hours_per_week' max out at 99 hours,\n",
        "which could lead to data misrepresentation. One way to address this is by\n",
        "representing 'hours_per_week' as a binary \"working 40 hours/not working 40\n",
        "hours\" feature. Also keep in mind that data was extracted based on work hours\n",
        "being greater than 0. In other words, this feature representation exclude a\n",
        "subpopulation of the US that is not working. This could skew the outcomes of the\n",
        "model.''')\n",
        "if feature == \"fnlwgt\":\n",
        "  print(\n",
        "\"\"\"'fnlwgt' represents the weight of the observations. After fitting the model\n",
        "to this data set, if certain group of individuals end up performing poorly \n",
        "compared to other groups, then we could explore ways of reweighting each data \n",
        "point using this feature.\"\"\")\n",
        "if feature == \"gender\":\n",
        "  print(\n",
        "\"\"\"Looking at the ratio between men and women shows how disproportionate the data\n",
        "is compared to the real world where the ratio (at least in the US) is closer to\n",
        "1:1. This could pose a huge probem in performance across gender. Considerable\n",
        "measures may need to be taken to upsample the underrepresented group (in this\n",
        "case, women).\"\"\")\n",
        "if feature == \"capital_gain / capital_loss\":\n",
        "  print(\n",
        "\"\"\"As alluded to in Task #1, both 'capital_gain' and 'capital_loss' could be \n",
        "indicative of income status as only individuals who make investments register \n",
        "their capital gains and losses. The caveat is that over 90% of the values in \n",
        "both 'capital_gain' and 'capital_loss' are 0, and it's not entirely clear from \n",
        "the description of the data set why that is the case. That is, we don't know \n",
        "whether we should interpret all these 0s as \"no investment gain/loss or \"\n",
        "investment gain/loss is unknown.\" Lack of context is always a flag for concern, \n",
        "and one that could trigger fairness-related issues later on. For now, we are \n",
        "going to omit these features from the model, but you are more than welcome to \n",
        "experiment with them if you come up with an idea on how capital gains and \n",
        "losses should be handled.\"\"\")\n",
        "if feature == \"age\":\n",
        "  print(\n",
        "'''\"age\" has a lot of variance, so it might benefit from bucketing to learn\n",
        "fine-grained correlations between income and age, as well as to prevent\n",
        "overfitting.''')\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "n3OT-YVpftEI"
      },
      "source": [
        "## Predicting income using the Keras API \n",
        "\n",
        "Now that we have a better sense of the Adult dataset, we can now begin with creating a neural network to predict income. In this section, as with previous exercises, we will be using TensorFlow's Keras API (specifically, `tf.keras.Sequential`) to construct our neural network model."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ECBRATBVG4rn"
      },
      "source": [
        "### Convert Adult Dataset into Tensors\n",
        "We first have to define our input fuction, which will take the Adult dataset that is in a pandas DataFrame and convert it a Numpy array. \n",
        "\n",
        "While a pandas DataFrame is great — especially when working with Facets and other Python modules that visualize data — `tf.keras.Sequential` doesn't accept a pandas DataFrame as a data type. Luckily for us, it's quite trivial to convert a pandas DataFrame into a Numpy array, which is an accepted data type."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "Bt-rQvJLx4Hm",
        "colab": {}
      },
      "source": [
        "def pandas_to_numpy(data):\n",
        "  '''Convert a pandas DataFrame into a Numpy array'''\n",
        "  # Drop empty rows.\n",
        "  data = data.dropna(how=\"any\", axis=0)\n",
        "\n",
        "  # Separate DataFrame into two Numpy arrays\"\n",
        "  labels = np.array(data['income_bracket'] == \">50K\")\n",
        "  features = data.drop('income_bracket', axis=1)\n",
        "  features = {name:np.array(value) for name, value in features.items()}\n",
        "  \n",
        "  return features, labels"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "0mz2sts6IjBO"
      },
      "source": [
        "### Represent Features in TensorFlow\n",
        "TensorFlow requires that data maps to a model. To accomplish this, you have to use ```tf.feature_columns``` to ingest and represent features in TensorFlow."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "tAG5hUJwx725",
        "colab": {}
      },
      "source": [
        "#@title Create categorical feature columns\n",
        "\n",
        "# Since we don't know the full range of possible values with occupation and\n",
        "# native_country, we'll use categorical_column_with_hash_bucket() to help map\n",
        "# each feature string into an integer ID.\n",
        "occupation = tf.feature_column.categorical_column_with_hash_bucket(\n",
        "    \"occupation\", hash_bucket_size=1000)\n",
        "native_country = tf.feature_column.categorical_column_with_hash_bucket(\n",
        "    \"native_country\", hash_bucket_size=1000)\n",
        "\n",
        "# For the remaining categorical features, since we know what the possible values\n",
        "# are, we can be more explicit and use categorical_column_with_vocabulary_list()\n",
        "gender = tf.feature_column.categorical_column_with_vocabulary_list(\n",
        "    \"gender\", [\"Female\", \"Male\"])\n",
        "race = tf.feature_column.categorical_column_with_vocabulary_list(\n",
        "    \"race\", [\n",
        "        \"White\", \"Asian-Pac-Islander\", \"Amer-Indian-Eskimo\", \"Other\", \"Black\"\n",
        "    ])\n",
        "education = tf.feature_column.categorical_column_with_vocabulary_list(\n",
        "    \"education\", [\n",
        "        \"Bachelors\", \"HS-grad\", \"11th\", \"Masters\", \"9th\",\n",
        "        \"Some-college\", \"Assoc-acdm\", \"Assoc-voc\", \"7th-8th\",\n",
        "        \"Doctorate\", \"Prof-school\", \"5th-6th\", \"10th\", \"1st-4th\",\n",
        "        \"Preschool\", \"12th\"\n",
        "    ])\n",
        "marital_status = tf.feature_column.categorical_column_with_vocabulary_list(\n",
        "    \"marital_status\", [\n",
        "        \"Married-civ-spouse\", \"Divorced\", \"Married-spouse-absent\",\n",
        "        \"Never-married\", \"Separated\", \"Married-AF-spouse\", \"Widowed\"\n",
        "    ])\n",
        "relationship = tf.feature_column.categorical_column_with_vocabulary_list(\n",
        "    \"relationship\", [\n",
        "        \"Husband\", \"Not-in-family\", \"Wife\", \"Own-child\", \"Unmarried\",\n",
        "        \"Other-relative\"\n",
        "    ])\n",
        "workclass = tf.feature_column.categorical_column_with_vocabulary_list(\n",
        "    \"workclass\", [\n",
        "        \"Self-emp-not-inc\", \"Private\", \"State-gov\", \"Federal-gov\",\n",
        "        \"Local-gov\", \"?\", \"Self-emp-inc\", \"Without-pay\", \"Never-worked\"\n",
        "    ])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "Jwtuu8MmyKCJ",
        "colab": {}
      },
      "source": [
        "#@title Create numeric feature columns\n",
        "# For Numeric features, we can just call on feature_column.numeric_column()\n",
        "# to use its raw value instead of having to create a map between value and ID.\n",
        "age = tf.feature_column.numeric_column(\"age\")\n",
        "fnlwgt = tf.feature_column.numeric_column(\"fnlwgt\")\n",
        "education_num = tf.feature_column.numeric_column(\"education_num\")\n",
        "capital_gain = tf.feature_column.numeric_column(\"capital_gain\")\n",
        "capital_loss = tf.feature_column.numeric_column(\"capital_loss\")\n",
        "hours_per_week = tf.feature_column.numeric_column(\"hours_per_week\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "3WqAbug6jePb"
      },
      "source": [
        "#### Make Age a Categorical Feature\n",
        "\n",
        "If you chose `age` when completing **FairAware Task #3**, you will have noticed that we suggested *bucketing* (also known as *binning*) this feature, grouping together similar ages into different groups. This might help the model generalize better across age. As such, we will convert `age` from a numeric feature (technically, an [ordinal feature](https://en.wikipedia.org/wiki/Ordinal_data)) to a categorical feature."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "HxVm8X15yLR7",
        "colab": {}
      },
      "source": [
        "age_buckets = tf.feature_column.bucketized_column(\n",
        "    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "2lx4JuLdi7jw"
      },
      "source": [
        "#### Consider Key Subgroups\n",
        "\n",
        "When performing feature engineering, it's important to keep in mind that you may be working with data drawn from individuals belonging to subgroups, for which you'll want to evaluate model performance separately.\n",
        "\n",
        "**_NOTE:_** *In this context, a subgroup is defined as a group of individuals who share a given characteristic—such as race, gender, or sexual orientation—that merits special consideration when evaluating a model with fairness in mind.*\n",
        "\n",
        "When we want our models to mitigate, or leverage, the learned signal of a characteristic pertaining to a subgroup, we will want to use different kinds of tools and techniques—**most of which are still actively being researched and developed**. You can find a list of related research work and techniques at our [Responsible AI Practices](https://ai.google/responsibilities/responsible-ai-practices/?category=fairness) page.\n",
        "\n",
        "As you work with different variables and define tasks for them, it can be useful to think about what comes next. For example, *where are the places where the interaction of the variable and the task could be a concern?*"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "5aD1OM8egad9"
      },
      "source": [
        "### Define the Model Features\n",
        "\n",
        "Now we can explicitly define which feature we will include in our model.\n",
        "\n",
        "We'll consider `gender` a subgroup and save it in a separate `subgroup_variables` list, so we can add special handling for it as needed."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "O68xV_24gbnD",
        "colab": {}
      },
      "source": [
        "# List of variables, with special handling for gender subgroup.\n",
        "variables = [native_country, education, occupation, workclass, \n",
        "             relationship, age_buckets]\n",
        "subgroup_variables = [gender]\n",
        "feature_columns = variables + subgroup_variables"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "3nYSMg67jWaA"
      },
      "source": [
        "### Train a Deep Neural Net Model on Adult Dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_kRL5rScH1F7"
      },
      "source": [
        "With the features now ready to go, we can try predicting income using deep learning.\n",
        "\n",
        "For the sake of simplicity, we are going to keep the neural network architecture light by simply **defining a feed-forward neural network with two hidden layers**.\n",
        "\n",
        "But first, we have to convert our high-dimensional categorical features into a low-dimensional and dense real-valued vector, which we call an embedding vector. Luckily, ```indicator_column``` (think of it as one-hot encoding) and ```embedding_column``` (that converts sparse features into dense features) helps us streamline the process.\n",
        "\n",
        "Based on our analysis of the data set from previous FairAware Tasks, we are going to move forward with the following features:\n",
        "\n",
        "*   `workclass`\n",
        "*   `education`\n",
        "*   `age_buckets`\n",
        "*   `relationship`\n",
        "*   `native_country`\n",
        "*   `occupation`\n",
        "\n",
        "All other features will be omitted from training — but you are welcome to experiment. `gender` is the only feature that will be used to filter the test set for subgroup evaluation purposes.\n",
        "\n",
        "The following cell creates the deep columns required to define the input layer of the model:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "code",
        "colab_type": "code",
        "id": "bnyw4cyLTSUB",
        "colab": {}
      },
      "source": [
        "deep_columns = [\n",
        "    tf.feature_column.indicator_column(workclass),\n",
        "    tf.feature_column.indicator_column(education),\n",
        "    tf.feature_column.indicator_column(age_buckets),\n",
        "    tf.feature_column.indicator_column(relationship),\n",
        "    tf.feature_column.embedding_column(native_country, dimension=8),\n",
        "    tf.feature_column.embedding_column(occupation, dimension=8),\n",
        "]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "lBaCn_Z1PshC"
      },
      "source": [
        "With all the data preprocessing taken care of, we can now define and compile the deep neural net model. Start by using the parameters defined below. (Later on, after you've defined evaluation metrics and evaluated the model, you can come back and tweak these parameters to compare results.)\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "tQZ1kumWk8XO",
        "colab": {}
      },
      "source": [
        "#@title Define Deep Neural Net Model\n",
        "\n",
        "# Parameters from form fill-ins\n",
        "HIDDEN_UNITS_LAYER_01 = 128 #@param\n",
        "HIDDEN_UNITS_LAYER_02 = 64 #@param\n",
        "LEARNING_RATE = 0.1 #@param\n",
        "L1_REGULARIZATION_STRENGTH = 0.001 #@param\n",
        "L2_REGULARIZATION_STRENGTH = 0.001 #@param\n",
        "\n",
        "RANDOM_SEED = 512\n",
        "tf.random.set_seed(RANDOM_SEED)\n",
        "\n",
        "# List of built-in metrics that we'll need to evaluate performance.\n",
        "METRICS = [\n",
        "  tf.keras.metrics.TruePositives(name='tp'),\n",
        "  tf.keras.metrics.FalsePositives(name='fp'),\n",
        "  tf.keras.metrics.TrueNegatives(name='tn'),\n",
        "  tf.keras.metrics.FalseNegatives(name='fn'), \n",
        "  tf.keras.metrics.BinaryAccuracy(name='accuracy'),\n",
        "  tf.keras.metrics.Precision(name='precision'),\n",
        "  tf.keras.metrics.Recall(name='recall'),\n",
        "  tf.keras.metrics.AUC(name='auc'),\n",
        "]\n",
        "\n",
        "regularizer = tf.keras.regularizers.l1_l2(\n",
        "    l1=L1_REGULARIZATION_STRENGTH, l2=L2_REGULARIZATION_STRENGTH)\n",
        "\n",
        "model = tf.keras.Sequential([\n",
        "  layers.DenseFeatures(deep_columns),\n",
        "  layers.Dense(\n",
        "      HIDDEN_UNITS_LAYER_01, activation='relu', kernel_regularizer=regularizer),\n",
        "  layers.Dense(\n",
        "      HIDDEN_UNITS_LAYER_02, activation='relu', kernel_regularizer=regularizer),\n",
        "  layers.Dense(\n",
        "      1, activation='sigmoid', kernel_regularizer=regularizer)\n",
        "])\n",
        "\n",
        "model.compile(optimizer=tf.keras.optimizers.Adagrad(LEARNING_RATE),  \n",
        "              loss=tf.keras.losses.BinaryCrossentropy(),\n",
        "              metrics=METRICS)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Tjhqo9XOP2VV"
      },
      "source": [
        "To keep things simple, we'll pass through the full training data 10 times."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "UtrhAXwvqtVD",
        "colab": {}
      },
      "source": [
        "#@title Fit Deep Neural Net Model to the Adult Training Dataset\n",
        "\n",
        "EPOCHS = 10 #@param\n",
        "BATCH_SIZE = 500 #@param\n",
        "\n",
        "features, labels = pandas_to_numpy(train_df)\n",
        "model.fit(x=features, y=labels, epochs=EPOCHS, batch_size=BATCH_SIZE)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "m0UHu5t-P7G7"
      },
      "source": [
        "We can now evalute the overall model's performance using the test set."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "HDV8hYqvncCy",
        "colab": {}
      },
      "source": [
        "#@title Evaluate Deep Neural Net Performance\n",
        "\n",
        "features, labels = pandas_to_numpy(test_df)\n",
        "model.evaluate(x=features, y=labels);"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "7j0LrXMGlTDl"
      },
      "source": [
        "You can try retraining the model using different parameters. If you leave the parameters as is, then you see that this relatively simple deep neural net does a decent job in predicting income with an **overall accuracy of 0.8317** and an **AUC of 0.8817**. \n",
        "\n",
        "**But evaluation metrics with respect to subgroups are missing.** We will cover some of the ways you can evaluate at the subgroup level in the next section."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "sbwmbnUUU1kY"
      },
      "source": [
        "## Evaluating for Fairness Using a Confusion Matrix\n",
        "\n",
        "While evaluating the overall performance of the model gives us some insight into its quality, it doesn't give us much insight into how well our model performs for different subgroups.  \n",
        "\n",
        "When evaluating a model for fairness, it's important to determine whether prediction errors are uniform across subgroups or whether certain subgroups are more susceptible to certain prediction errors than others. \n",
        "\n",
        "A key tool for comparing the prevalence of different types of model errors is a *confusion matrix*. Recall from the [Classification module of Machine Learning Crash Course](https://developers.google.com/machine-learning/crash-course/classification/true-false-positive-negative) that a confusion matrix is a grid that plots predictions vs. ground truth for your model, and tabulates statistics summarizing how often your model made the correct prediction and how often it made the wrong prediction. \n",
        "\n",
        "Let's start by creating a binary confusion matrix for our income-prediction model—binary because our label (`income_bracket`) has only two possible values (`<50K` or `>50K`). We'll define an income of `>50K` as our **positive label**, and an income of `<50k` as our **negative label**.\n",
        "\n",
        "**NOTE:** *Positive* and *negative* in this context should not be interpreted as value judgments (we are not suggesting that someone who earns more than 50k a year is a better person than someone who earns less than 50k). They are just standard terms used to distinguish between the two possible predictions the model can make.\n",
        "\n",
        "Cases where the model makes the correct prediction (the prediction matches the ground truth) are classified as **true**, and cases where the model makes the wrong prediction are classified as **false**.\n",
        "\n",
        "Our confusion matrix thus represents four possible states:\n",
        "\n",
        "* **true positive**: Model predicts `>50K`, and that is the ground truth.\n",
        "* **true negative**: Model predicts `<50K`, and that is the ground truth.\n",
        "* **false positive**: Model predicts `>50K`, and that contradicts reality.\n",
        "* **false negative**: Model predicts `<50K`, and that contradicts reality.\n",
        "\n",
        "**NOTE:** If desired, we can use the number of outcomes in each of these states to calculate secondary evaluation metrics, such as [precision and recall](https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "nsUj_XZHU_mI"
      },
      "source": [
        "### Plot the Confusion Matrix\n",
        "\n",
        "Since we've already defined which metrics we're interested in back when we defined and compiled our model, all we have to do now is:\n",
        "\n",
        "\n",
        "1.   Define a function that will help us visualize the heatmap.\n",
        "2.   Select which subgroup we're interested in, then pass that subgroup selection into `tf.keras.Model.predict()` for evaluation.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "ouE72GWSxu1j",
        "colab": {}
      },
      "source": [
        "#@title Define Function to Visualize Binary Confusion Matrix\n",
        "def plot_confusion_matrix(\n",
        "    confusion_matrix, class_names, subgroup, figsize = (8,6)):\n",
        "  # We're taking our calculated binary confusion matrix that's already in the \n",
        "  # form of an array and turning it into a pandas DataFrame because it's a lot \n",
        "  # easier to work with a pandas DataFrame when visualizing a heat map in \n",
        "  # Seaborn.\n",
        "  df_cm = pd.DataFrame(\n",
        "      confusion_matrix, index=class_names, columns=class_names, \n",
        "  )\n",
        "\n",
        "  rcParams.update({\n",
        "  'font.family':'sans-serif',\n",
        "  'font.sans-serif':['Liberation Sans'],\n",
        "  })\n",
        "  \n",
        "  sns.set_context(\"notebook\", font_scale=1.25)\n",
        "\n",
        "  fig = plt.figure(figsize=figsize)\n",
        "\n",
        "  plt.title('Confusion Matrix for Performance Across ' + subgroup)\n",
        "\n",
        "  # Combine the instance (numercial value) with its description\n",
        "  strings = np.asarray([['True Positives', 'False Negatives'],\n",
        "                        ['False Positives', 'True Negatives']])\n",
        "  labels = (np.asarray(\n",
        "      [\"{0:g}\\n{1}\".format(value, string) for string, value in zip(\n",
        "          strings.flatten(), confusion_matrix.flatten())])).reshape(2, 2)\n",
        "\n",
        "  heatmap = sns.heatmap(df_cm, annot=labels, fmt=\"\", \n",
        "      linewidths=2.0, cmap=sns.color_palette(\"GnBu_d\"));\n",
        "  heatmap.yaxis.set_ticklabels(\n",
        "      heatmap.yaxis.get_ticklabels(), rotation=0, ha='right')\n",
        "  heatmap.xaxis.set_ticklabels(\n",
        "      heatmap.xaxis.get_ticklabels(), rotation=45, ha='right')\n",
        "  plt.ylabel('References')\n",
        "  plt.xlabel('Predictions')\n",
        "  return fig"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hUvBYtwXVzlQ"
      },
      "source": [
        "Now that we have all the necessary functions defined, we can now compute the binary confusion matrix and evaluation metrics using the outcomes from [our deep neural net model](#scrollTo=3nYSMg67jWaA). The output of this cell is a tabbed view, which allows us to toggle between the confusion matrix and evaluation metrics table."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9enf_Jfi-AVS"
      },
      "source": [
        "### FairAware Task #4\n",
        "\n",
        "Use the form below to generate confusion matrices for the two gender subgroups: `Female` and `Male`. Compare the number of False Positives and False Negatives for each subgroup. Are there any significant disparities in error rates that suggest the model performs better for one subgroup than another?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "5TBzaWs1VKTa",
        "colab": {}
      },
      "source": [
        "#@title Visualize Binary Confusion Matrix and Compute Evaluation Metrics Per Subgroup\n",
        "CATEGORY  =  \"gender\" #@param {type:\"string\"}\n",
        "SUBGROUP =  \"Male\" #@param {type:\"string\"}\n",
        "\n",
        "# Labels for annotating axes in plot.\n",
        "classes = ['Over $50K', 'Less than $50K']\n",
        "\n",
        "# Given define subgroup, generate predictions and obtain its corresponding \n",
        "# ground truth.\n",
        "subgroup_filter  = test_df.loc[test_df[CATEGORY] == SUBGROUP]\n",
        "features, labels = pandas_to_numpy(subgroup_filter)\n",
        "subgroup_results = model.evaluate(x=features, y=labels, verbose=0)\n",
        "confusion_matrix = np.array([[subgroup_results[1], subgroup_results[4]], \n",
        "                             [subgroup_results[2], subgroup_results[3]]])\n",
        "\n",
        "subgroup_performance_metrics = {\n",
        "    'ACCURACY': subgroup_results[5],\n",
        "    'PRECISION': subgroup_results[6], \n",
        "    'RECALL': subgroup_results[7],\n",
        "    'AUC': subgroup_results[8]\n",
        "}\n",
        "performance_df = pd.DataFrame(subgroup_performance_metrics, index=[SUBGROUP])\n",
        "pd.options.display.float_format = '{:,.4f}'.format\n",
        "\n",
        "plot_confusion_matrix(confusion_matrix, classes, SUBGROUP);\n",
        "performance_df"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TF3B5h3c-7Fb"
      },
      "source": [
        "### Solution\n",
        "\n",
        "Click below for some insights we uncovered"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dhKR49AT_5ZK"
      },
      "source": [
        "Using default parameters, you may find that the model performs better for female than male. Specifically, in our run, we found that both accuracy and AUC for female (0.9137 and 0.9089, respectively) outperformed male (0.7923 and 0.8549, respectively). What is going on here?\n",
        "\n",
        "Notice the number of true positives (top-left corner) for female is way lower compared to male (479 to 3822). Recall that in Task #1 we noticed a disproportionately high representation of male in the data set (almost 2-to-1). If you further explore the data set using Facets Dive in Task #2 by setting the color to `income_bracket` and one of the axes to `gender`, then you will also find a disproportionately small number of female examples in the higher income bracket, our positive label. \n",
        "\n",
        "What this is all suggesting is that the model is **overfitting, particuarly with respect to female and lower income bracket**. In other words, this model will not generalize well, particularly with female data, as it does not have enough positive examples for the model to learn from. It is **not doing that much better with male, either, as there is a disproportionately small number of high income bracket compared to low income bracket** — though not nearly as poorly represented as with female.\n",
        "\n",
        "Hopefully going through this confusion matrix demonstration you find that the results varies slightly from the overall performance metrics, highlighting the importance of evaluating model performance across subgroup rather than in aggregate.\n",
        "\n",
        "In your work, make sure that you make a good decision about the tradeoffs between false positives, false negatives, true positives, and true negatives. For example, you may want a very low false positive rate, but a high true positive rate. Or you may want a high precision, but a low recall is okay.  \n",
        "\n",
        "**Choose your evaluation metrics in light of these desired tradeoffs.**"
      ]
    }
  ]
}
